Avoid unpad/pad repeated calls when `use_cache=False` #5

fxmarty · 2023-09-12T16:18:47Z

As per title. The difference is quite large. This is only done out of curiosity, cc @younesbelkada

Note: Speedup over the base PR is expected only in case of batch_size > 1 when padding / masked tokens are used. In the benchmark below, we use a padding percentage of 30%.

This is on a single A100 for meta-llama/Llama-2-7b-hf.

Forward only with no_grad mode

batch_size=4, len=1000

Transformers latency (ms)	Younes PR latency (ms)	This PoC latency (ms)	Younes speedup / transformers	This PoC speedup / transformers
395	321	237	1.228	1.665

batch_size=4, len=2000

Transformers latency (ms)	Younes PR latency (ms)	This PoC latency (ms)	Younes speedup / transformers	This PoC speedup / transformers
OOM	627	475	/	/

batch_size=8, len=500

Transformers latency (ms)	Younes PR latency (ms)	This PoC latency (ms)	Younes speedup / transformers	This PoC speedup / transformers
357	318	231	1.120	1.545

batch_size=2, len=4000

Transformers latency (ms)	Younes PR latency (ms)	This PoC latency (ms)	Younes speedup / transformers	This PoC speedup / transformers
OOM	327	262	/	/

forward + backward

batch_size=4, len=1500

Transformers latency (ms)	Younes PR latency (ms)	This PoC latency (ms)	Younes speedup / transformers	This PoC speedup / transformers
OOM	1353	1062	/	/

batch_size=2, len=3000

Transformers latency (ms)	Younes PR latency (ms)	This PoC latency (ms)	Younes speedup / transformers	This PoC speedup / transformers
OOM	1422	1178	/	/

batch_size=2, len=1000

Transformers latency (ms)	Younes PR latency (ms)	This PoC latency (ms)	Younes speedup / transformers	This PoC speedup / transformers
580	506	423	1.146	1.370

fxmarty · 2023-09-12T16:23:10Z

Benchmark script out of completeness: https://pastebin.com/zWE9Aedr

younesbelkada

The changes look overall great to me! I wonder if we can add padding_mask inside flash_kwargs. For me we are making the attention forward signature a bit more complicated but for the speedup we get it is great I think. Can you also confirm generate with use_cache works fine here?
I would also like to have a review from @ArthurZucker before merging this

fxmarty · 2023-09-14T07:24:09Z

This is a draft by the way, I just wanted to get results. I'm not sure if it is very fit for transformers though, with the modifications directly in the LlamaModel forward?

generate with use_cache=True can unfortunately not use this path, because the KV cache implementation in transformers makes the size of keys and values have a dynamic shape in the attention, where basically the cumulative sequence length, max sequence length always change, with length different than the hidden_states that is e.g. simply 1 in a decoding phase.

ArthurZucker

It's nice that you found a way to do this, but as you said not very transformers like and bloated, especially if this is un-usable with use_cache=True 😢

* Cohere Model Release (#1) Cohere Model Release * Remove unnecessary files and code (#2) Some cleanup * Delete cohere-model directory (#3) * Make Fix (#5) * Pr fixes (#6) * fixes for pr * pr fixes for the format * pr fixes for the format * src/transformers/models/auto/tokenization_auto.py * Tokenizer test (huggingface#8) * tokenizer test * format fix * Adding Docs and other minor changes (huggingface#7) * Add modeling tests (huggingface#9) * Smol Fix (huggingface#11) * tokenization tests are fixed * format fixes * fix pr doc tests * fix pr doc tests * fix pr doc tests * fix pr style check * small changes in cohere.md * FIX: Address final comments for transformers integration (huggingface#13) * fix modeling final nits and add proper test file * for now leave empty tests * add integration test * push new test * fix modeling cohere (huggingface#14) * Update chat templates to use the new API (huggingface#15) --------- Co-authored-by: ahmetustun <[email protected]> Co-authored-by: Younes Belkada <[email protected]> Co-authored-by: Matt <[email protected]>

avoid unpad/pad calls

9316223

fxmarty mentioned this pull request Sep 12, 2023

[core ] Integrate Flash attention 2 in most used models huggingface/transformers#25598

Merged

7 tasks

younesbelkada reviewed Sep 13, 2023

View reviewed changes

younesbelkada requested a review from ArthurZucker September 13, 2023 17:31

ArthurZucker reviewed Sep 14, 2023

View reviewed changes

younesbelkada mentioned this pull request Dec 20, 2023

[Flash Attention 2] Performance improvement huggingface/transformers#28160

Open

younesbelkada pushed a commit that referenced this pull request Mar 14, 2024

Make Fix (#5)

f964504

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid unpad/pad repeated calls when `use_cache=False` #5

Avoid unpad/pad repeated calls when `use_cache=False` #5

fxmarty commented Sep 12, 2023 •

edited

Loading

fxmarty commented Sep 12, 2023

younesbelkada left a comment

fxmarty commented Sep 14, 2023

ArthurZucker left a comment

Avoid unpad/pad repeated calls when use_cache=False #5

Are you sure you want to change the base?

Avoid unpad/pad repeated calls when use_cache=False #5

Conversation

fxmarty commented Sep 12, 2023 • edited Loading

Forward only with no_grad mode

batch_size=4, len=1000

batch_size=4, len=2000

batch_size=8, len=500

batch_size=2, len=4000

forward + backward

batch_size=4, len=1500

batch_size=2, len=3000

batch_size=2, len=1000

fxmarty commented Sep 12, 2023

younesbelkada left a comment

Choose a reason for hiding this comment

fxmarty commented Sep 14, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment

Avoid unpad/pad repeated calls when `use_cache=False` #5

Avoid unpad/pad repeated calls when `use_cache=False` #5

fxmarty commented Sep 12, 2023 •

edited

Loading